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Abstract 


We investigate a novel global orientation regression approach for articulated objects 
using a deep convolutional neural network. This is integrated with an in-plane image 
derotation scheme, DeROT, to tackle the problem of per-frame fingertip detection in 
depth images. The method reduces the complexity of learning in the space of articu¬ 
lated poses which is demonstrated by using two distinct state-of-the-art learning based 
hand pose estimation methods applied to fingertip detection. Significant classification 
improvements are shown over the baseline implementation. Our framework involves no 
tracking, kinematic constraints or explicit prior model of the articulated object in hand. 

To support our approach we also describe a new pipeline for high accuracy magnetic 
annotation and labeling of objects imaged by a depth camera. 

Introduction 

this paper we propose a method for normalizing out the effects of rotation on highly artic¬ 
ulated motion of deforming geometric surfaces such as hands observed by a depth camera. 
Changing the global rotation of an object directly increases the variation in appearance of the 
object parts. The work of [O] physically removes this variability with a wristworn camera 
and samples only a single 3D point on each finger to perform full hand pose estimation. For 
markerless situations, removing variability through partial canonization can significantly re¬ 
duce the space of possible images used for pose learning instead of trying to explicitly learn 
the variability through data augmentation. In [113] the authors show that learning a derotated 
2D patch instead of the original one around a feature point dramatically reduces the learning 
capacity required and improves the classification results while using fewer randomized trees. 
To develop our method we use fingertip detection as a challenging representative scenario 
with a propensity for self occlusion and high rotational variability relative to an imaging sen¬ 
sor. Many approaches in the literature use fingertip or hand part detection towards the goal 
of full hand pose (e.g. [0],[S],[IZ]]],[Q]) however, they all approach the problem by trying 
to learn on datasets by augmenting rotational variability. Instead, we propose to remove this 
hand space variability during both the training phase and run-time. To this end we propose to 
learn the rotation using a deep convolutional neural network (CNN) in a regression context 
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Figure 11 Examples from HandNet test set detections. The colors represent fingertips that are con'ectly located and identified. 
The white boxes indicate false detections with the error threshold chosen to be 1cm. The top two rows are trained and tested on 
non-derotated data. The bottom two are trained and tested on derotated data and then rotated back to the non-derotated space. The 
detections are overlaid on the IR image from the camera which is not part of the classification process, a) Successful examples 
for all methods, b) Representative challenging examples for which derotation enables better performance, c) Failure cases where 
derotation fails to improve the results. 


based on a network similar to that of [123]]. We show how this can be used to predict full three 
degrees of freedom (DOF) orientation information on a database of hand images captured 
by a depth sensor. We combine the predicted orientation with a novel in-plane derotation 
scheme. The "Rule of thumb" is derived from the following insight; there is almost always 
an in-plane rotation which can be applied to an image of the hand which forces the base of 
the thumb to be on the right side of the image. This implies that the ambiguity inherrent in 
rotationally variant features can be overcome by derotating the hand image to a canonical 
pose instead of augmenting a dataset with all variations of the rotational degrees of freedom 
as is commonly done. Figure 1 shows examples of extensive pose variation that can benefit 
from our approach ^ . 

No currently available hand datasets (e.g. [E3],[Q],[I23]]) include accurate full 3 DOF 
ground truth hand orientations on a large database of real depth images. Using joint lo¬ 
cation data from NYUHands [123]] it is possible to extract a global hand orientation per pose. 
However, we found that the size of this dataset and rotational variability are not optimal for 
learning to predict 3 DOF orientation. A significant contribution of this paper is therefore the 
creation of a new, large-scale database of fully annotated depth images with 212928 unique 
hand poses captured by an Intel RealSense camera that we call HandNet^. For the purpose of 
effectively annotating such a large dataset we describe a novel image annotation technique. 
To overcome the severe occlusion inherrent in such a process we use DC magnetic trackers 
which are surprizingly sparsely used by the vision community considering their high accu¬ 
racy, speed and robustness to occlusions. Using our deep derotation method (DeROT) we 
show up to 20.5% improvement in mean average precision (mAP) over our baseline results 
for two state-of-the-art approaches for fingertip detection in depth images, namely, a random 
decision tree [O] (RDT) and a deep convolutional neural network [EH] (CNN). We also com¬ 
pare our results to a non-learning based method similar to PCA and show that it produces 
inferior results, further supporting the proposed use of DeROT. 


2 Building HandNet: Creation and annotation 

Synthetic databases such as those created using [123] have a severe disadvantage in that they 
cannot accurately account for natural hand motion, occlusions and noise characteristics of 

* All graphs and images in this paper are best viewed in color. 

^To advance research in the field this database and relevant code is available at www . cs . technion .ac.il/ 
~tward/HandNet/ 
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Figure 21 The data capture setup, a) 2mm magnetic sensors. The larger rectangular sensors are not used, b) A fingertip sensor 
inside the inner seam, c) Virtual model used for planning a multi-sensor setup. We only use 5 sensors, d) The RealSense camera 
rigidly fixed to the TrakStar transmitter, e) The back of the wooden calibration board where the glass sensor housings are firmly 
pushed through, f) The front of the calibration board where the glass sensor housings are visible on the corners as seen in the inset. 



Figure 31 The available data annotations after calibration, a) Color image. Illustrates a full hand setup for this work. The 
color is not used, b) The RGB axes indicate the measured location and orientation of each fingertip and the back of the palm, c) 
IR image(not used) overlaid with the labels generated from the raycasting described in Section 2. d) IR image overlaid with the 
generated heatmaps per fingertip and the global orientation of the hand represented as an oriented bounding box (not used). 


real depth cameras. The creation of a large hand pose database of real depth images with 
consistent annotations is therefore of great importance, but beyond the capability of human 
annotators. The NYUHands database [EUl] uses a full model of the hand and a three-camera 
setup to annotate hand joint locations. There are instances where fingers are obstructed and 
accurate orientation information is not reliable. Similarly the method of [E3] uses inverse 
kinematics coupled with a colored glove which also has the disadvantage of not having ex¬ 
plicitly measured orientation as well as fingertip locations which are obstructed from the 
depth camera. An alternative to model based systems are sparse marker systems such as 
those used by [E3], however, the excessive cost of a modem mocap setup such as Vicon 
as well as the occlusion problem make such an approach unattractive. In contrast, modern 
DC magnetic trackers like the TrakStar [ID] are robust to metallic interference and obstruc¬ 
tion by non-ferrous metals, and provide sub-millimeter and sub-degree accuracy for location 
and orientation relative to a fixed based station. Despite their almost non-existent use in 
modern computer vision literature, we have found them to be an excellent measurement and 
annotation tool. 

Sensors. To build and annotate our HandNet database we use a RealSense camera com¬ 
bined with 2inin TrakStar magnetic trackers. We affix the sensors to a user’s hand and fin¬ 
gertips by using tight elastic loops with sensors in sewn seam pockets. This prevents lateral 
and medial movement along the finger. This can be seen in Figure 2. The skin tight elastic 
loops have an additional significant benefit over gloves in that the depth profile and hand 
movements are not affected by the attached sensors and thus do not pollute the data. 

Callibration. Camera callibration with known correspondences is a well studied prob¬ 
lem [IZl]. However, in our case we need to callibrate between a camera and a sensor frame. 
We do this by positioning the magnetic sensors on the corners of a checkerboard pattern 
thereby creating physical correspondence between the detected corner locations and the ac¬ 
tual sensors. This setup can be seen in Figure 2. We use the extracted 2D locations of the 
corner points on the callibration board [S] together with the sampled sensor 3D locations to 
perform EPnP [O] to determine the extrinsic configuration between the devices. 










4 


WETZLER ET AL.; DEEP DEROTATION FOR IMPROVED FINGERTIP DETECTION 



1 Trained on non-derotated ( | | Trained on non-derotated | | 

imTrained on derotated 

1 KlTTrained on derotated | 

1 ^Tested on non-derotated | | Tested on derotated | | 

[Tested on non-derotated 1 

1 ^^Tested on derotated | 


i 4^ i 

1 1 


r T" 

1 1 

i » i 

1 »'%!*■ 1 

1 1 

1 1 


(I) (II) (III) (IV) 


Figure 4l Understanding derotation: We represent the space of poses by a non-uniform 2D region with representative hand 
poses. Red and green represent the pose-space covered by training images and testing images respectively. Each 1,11,III,IV indicates 
one of the 4 possible combinations of training and testing for a machine learning method where the database remains fixed in size. 
The larger region indicates greater pose vaiiability while the smaller represents less. Intuitively, by training on a space with low 
variance and testing in this same space (type IV) we expect to see an improvement over the opposite (type I). Section 5.1 supports 
this intuition. 


Annotation. We model each sensor as a 3D oriented ellipsoid. We then raycast the 
ellipsoid into the camera frame and set the label to be the identity of the ellipsoid closest to 
the camera for every pixel. We also create a heatmap /i, for each fingertip i using the same 
technique but setting the value per pixel to be gaussian over the distance to the projected 
sensor location. An example of both types of annotation can be seen in Figure 3. 

Recording the database. The database is created from 10 participants (half male, half 
female, different hand sizes) who perform random hand motions with extensive pose varia¬ 
tion while wearing the magnetic sensors. The RealSense camera operates at 58fps producing 
640 X 480 depth maps which we reduce to 320 x 240. The TrakStar samples measurements 
at a rate of 720Hz. In total we recorded 256987 images. A portion of these images were 
removed because of low quality. The final dataset is 212928 frontal images including full 
annotation of the position and orientation of each fingertip and the back of the palm. Af¬ 
ter recording each participant we used a software utility to add offsets to the rotation and 
location of each sensor to adjust for greater consistency in positioning across subjects. 


3 Fingertip detection 

Although there are many non-learning based hand pose methods that can produce fingertip 
locations (e.g. [□, CB, D, 113]), they use kinematic and frame to frame constraints coupled 
with hand modelling. In contrast, here we specifically focus on per frame fingertip detection 
in depth images without either tracking or kinematic modelling. For our pipeline we first 
segment the target hand from the depth image using a fast depth based flood-fill method 
seeded either from the previous frame for real-time use and testing or from the ground truth 
hand location for building the database. Using the center of mass (CoM) of the segmented 
hand and its average depth value we define a depth dependent bounding box of size w = 5*1^ 
for a RealSense camera (HandNet) and w = 30^ for a Kinect camera (NYUHands) where 
z is the depth of the CoM of the segmented hand. We derotate the image about the CoM 
using an angle of rotation according to the in-plane angle produced by DeROT described in 
Section 4. This comes from the predicted full 3D orientation at run-time or from the ground 
truth sensor orientation for database construction or testing. We then crop the image using 
the bounding box. We now describe our modifications of the two different, learning-based 
fingertip detectors that we use in this work. 
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3.1 Random decision tree 

We follow the method of Keskin et al. [O] where a random decision tree (RDT) ensemble 
learns hand part labels for every pixel in a depth image of a hand. We refer the reader to the 
supplementary material of our paper as well as [O, O] for specihc details of this approach. 
However, here we propose a number of key differences which we found specihcally helpful 
for hngertip detection and run-time efficiency. We use the same random binary depth at¬ 
tributes per pixel but spatially distribute them according to an exponential sampling pattern 
similar to that of BRISK [ED]. In addition to this, we use only a single RDT which contrasts 
with the common use of multiple trees in an ensemble. After training our single RDT the 
class distributions stored at each leaf can be used for inference because they represent the 
empirical estimate of the posterior probability p{c\x) of hand part label c given the image 
evidence x. Inferring the most likely hngertip identity label is therefore simply performed 
pixel-wise by hnding the c* which maximizes p{c\x) per pixel. However, label inference 
performed this way results in noisy labels as neighboring classihcations do not inhuence one 
another. Without adding more trees we propose a simple but highly effective spatial regular¬ 
ization: for each hngertip i we treat the posterior p{c = i\x) for all pixels as an image and 
convolve it with a discrete 2D gaussian smoothing kernel ga with blur radius a. This has the 
effect of correlating the posterior label distributions of nearby pixels. Therefore every pixel 
q is labeled by hngertip identity (including palm and wrist labels) according to 

c* {q\x) = argmax {gc*Pc=i\x) iq) ■ (1) 

iG{0..6} 

Finally, we found that the close proximity of hngers compromises standard mean-shift [Q] 
clustering. Instead we detect the largest label blobs in the label image from Equation 1 above 
a certain area threshold. The 2D hngertip locations are then assigned to the blob centers and, 
if necessary, the average depth value for each blob can be used to generate the 3D camera- 
space coordinates. 

Training the RDT. Training optimal decision trees is known to be NP-complete [0] and 
therefore trees are built from the root down using breadth-hrst greedy optimization over tree 
node impurity. We use the Gini impurity measure which is slightly cheaper to compute than 
the more typical entropy measure. To build our database for training an RDT we extracted 
80% of the hngertip pixels in our training datasets and 50% of the non-hngertip hand pixels. 
For HandNet this results in a training dataset of 500 million sample pixels totaling 600GB 
of data for 1200 attributes. Our tree-builder trains an unpruned randomized tree on 4x GTX 
580 GPUs and an Intel 17 processor with 48GB of RAM in 16 hours for a tree depth of 
21 with 18000 query tests per node. We are not aware of another single-workstation tree- 
builder capable of handling this quantity of data. The very large number of examples helps 
to prevent overhtting demonstrated by single RDTs. 

3.2 Convolutional neural network 

For our second evaluated method we build a CNN architecture based on Tompson et al. 
[Em] to predict the location of the hve hngertips by using the maximum location in a set of 
heat maps which implicitly represent hngertip locations. We refer the reader to that work 
for specihc details and to our supplementary material for the explicit architecture of our 
implementation. This multi-layer deep approach is critical for an input space as complicated 
as the set of images of an articulated object and we found that the deeper convolutional layers 
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Figure 51 This graph shows the predicted value of all 9 coefficients of the hand orientation matrix in red relative to the ground 
truth in yellow. For clarity we order each ground truth coefficient monotonically and apply this reordering to the predicted results. 
The mean squared error for all the coefficients on the HandNet test set before and after S VD is 0.0271 and 0.0234 respectively. 


extract feature responses on a higher semantic level such as oriented fingertips. Using the 
heatmap based error objective helps to spatially regularize the network during training. For 
input to the CNN we set Di to be the cropped depth resized to 96 x 96 pixels. We then 
downsample it by a factor of two twice to produce D 2 and D3. We use a subtractive form 
of local contrast normalization (LCN) [DU, EU] so that D, f— D, — ga* Dt using a gaussian 
smoothing kernel with (7 = 5 pixels. The triplet {D\^D 2 ,Dt,) is then input to the network. 
The trained network outputs a heatmap hi per hngertip i for new data. Our method differs for 
fingertip detection in that we augment the output by a non-hngertip heatmap that is strong 
wherever a fingertip is not likely to be present. Also, instead of fitting a gaussian model to the 
strongest mode in the low resolution heatmaps, we instead upsample each 18 x 18 fingertip 
heatmap hi to a fixed size of 128 x 128 with a smoothing bi-linear interpolator. Similar to 
Section 3.1 every pixel q is labeled with hngertip identity (including a non-hngertip class) 

c* (q) = argmaxh, {q). (2) 

iG{0..5} 

As in Section 3.1 the hngertip locations are given by the location of the largest label blob. 

Training the CNN. Both the orientation regression CNN of the next Section as well as 
the described hngertip CNN are trained using Caffe [O] on an NVidia GTX 980 with an i7 
processor and 16GB of onboard RAM. We train both with a Euclidean loss and a batch size 
of 100 for 100000 iterations with stochastic gradient descent. We start with a learning rate 
of 0.01 and reduce it by a factor of 0.2 after every ten thousand iterations. We found that 
repeated hne-tuning was necessary to help network convergence. 

4 Derotation 

4.1 Orientation regression 

We adapt the deep convolutional architecture from Section 3 to predict full 3 DOF hand 
orientation. Instead of a heatmap, we directly predict the 9 coefficients of the rotation ma¬ 
trix. There are only 3 degrees of freedom in a regular rotation but by using 9 parameters 
and a large database we are effectively regularizing our over-parameterized output. The 
representation of a rotation matrix in this way is unique in SO (3) unlike quaternions and 
Euler angles which we found to be noisy and unreliable. This noise was most visible when 
trying to predict a single representative angle. Eor training we use Euclidian loss and do 
not enforce orthonormality. However, the output of this CNN is directly projected onto the 
closest unitary matrix using the SVD decomposition R = USV^. R = UV^ then provides 
a least squares optimal projection into SO {3), if we additionally enforce det(.R) = 1. Eig- 
ure 5 shows the result of predicting the 9 ground truth coefficients for HandNet and the full 
network architecture can be seen in the supplementary material. 
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4.2 DeROT: Designing a derotation method 


Algorithm 1 Derotation procedure 

1: procedure Derotate(R) 

5: 

else 

2: r„,ig„ ^ argmax,.g,,j_,^ || (0,0, 1) ■ nW 

6: 

a ■<— aTan2{r2x,>’2v) +90 

3: if = r 2 (thumb aligned axis) then 

7: 

end if 

4: a t-at an2{r3^,r,y) +90+^ J*" 

8: 

9: 

Return a 
end procedure 



Figure 6l Synthetic and real examples of DeROT. a) The depth projection of the virtual hand before applying DeROT can be 
seen on the left wall of the cube representing the camera plane. The axis marked rorient is projected onto the camera plane and used 
in DeROT to define the angle a. The puiple circle contains the resulting image of the hand after applying derotation by angle a. 
b) The top row of images are un-derotated. The bottom row have been derotated by a obtained by DeROT. Note that the thumb is 
consistently on the right of the image. 

We take advantage of the orientation prediction R = [rir 2 ri] to compute an angle a which 
we will use for rotating the camera image about its center. The aim of this is to reduce pose 
variance by heuristically forcing the thumb to be on the right side of the image. We could 
use a predehned axis and set the angle a with which to rotate the image to be that between 
the projection of this axis and the upwards image direction. Unfortunately, when this axis 
mostly points to or away from the camera the projection onto the screen will be small and 
noisy. As a simple heuristic we detect if this is the case and if so choose an alternative axis. 
Specifically we hrst determine the predicted axis VaUgn most aligned with the camera z axis 
as raiign = argmax^.gi^j || (0,0,1) • r,j|. If raUgn is either the palm pointing direction or 
the direction of the extended fingers then we can be sure that the thumb direction r 2 will be 
non-noisy for this case and set rorient = ^ 2 - If the test yields instead that raUgn = ^2 (i.e. thumb 
direction is mostly pointing towards or away from the camera) then we instead set rorient = ^3 
which is the palm vector. This procedure is summarized in Algorithm 1. Synthetic and real 
examples can be seen in Figure 6. This choice is arbitrary and can be adapted for objects 
other than the hand. We thus dehne DeROT to be the combination of using the CNN from 
Section 4.1 to predict R together with this derotation heuristic. 


4.3 Derotation with PCA and Procrustes 

Instead of using DeROT, an alternative approach is to extract the principal axes of the hand 
silhouette using PCA and taking the rotation angle of the largest axis to the vertical image 
axis. We have found that a similar but more stable option is to determine an enclosing ellipse 
using a Procrustes like algorithm on the convex hull of the points V of the hand segmentation. 
The minimum area enclosing ellipse can be found efficiently over the points x, G convhull (V) 
by minimizing — log(det(A)), s.t (x, —x, )^A(x;-x,)< 1 for A,x, dehning the ellipse. We 
solve this using Khachiyan’s algorithm [□]. However, as shown in Section 5.1 even with this 
added stability the method reduces performance rather than improving it. 
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5 Experiments 

5.1 Evaluation protocol and data 

Experiments. We perform our experiments using our HandNet database and the publicly 
available database NYUHands [El]. All experiments are performed separetly on the two 
databases. Our baseline results come from (I) training on non-derotated data and testing on 
non-derotated data. We compare this to (II) training on non-derotated data while testing with 
derotated data, (III) training on derotated data while testing with non-derotated data, (IV) 
training on derotated data while testing with derotated data. 

Non-derotated data. For HandNet training we randomly select 202928 images and use 
the remaining 10000 images for testing. For NYUHands we use all 3 camera views (72757 
images per view) for training and the frontal view for testing (8252 images). We slightly 
dilute the training and testing sets according to our hand segmentation pipeline which results 
in 184100 training images and 7241 testing images. 

For experiment types (II) and (IV) we use this data to train two CNN orientation regres¬ 
sion networks; one for each dataset. We use the same data for training the RDT and CNN 
fingertip detectors for experiment types (I) and (II). However, for testing the fingertip detec¬ 
tors in experiments (I) and (III), we rotate each testing image by uniformly random in-plane 
rotational offsets between -90 and 90 degrees. This further guarantees that the testing data is 
different from the training data. 

Derotated data. Experiment types (III) and (IV) use training data which is first derotated 
by an Oracle which we define to be DeROT that uses the ground truth Rgt obtained from the 
magnetic sensors. With experiment types (II) and (IV) we first apply the same uniform 
random image rotation to the test images exactly as for experiment types (I) and (II). We 
then apply one of the following: (a) Procrustes derotation, (b) DeROT using R predicted by 
the CNN regression network, (c) Oracle derotation with Rgt. 

Mean precision and mean average precision. We compute precision and recall accord¬ 
ing to the protocol of [S]. We set prediction confidence as the value at the location of the 
fingertip detection in the 128 x 128 channel heatmap for each fingertip. The mean precision 
(mP) represents the mean precision over all fingertips at a recall rate of 100%. Mean average 
precision (mAP) measures the mean of all the areas under the precision-recall curves for 
each fingertip and takes into account the behaviour over all confidence values. 

Error threshold. The error of a prediction is the distance to the ground truth location. 
If a fingertip is more than 6 pixels from the ground truth position it is considered a false 
positive. The threshold of 6 pixels roughly translates into a distance of 1cm for both HandNet 
and NYUHands in an image patch of size 128 x 128 cropped according to Section 3. 1cm 
is a natural threshold to choose as the distance between adjacent fingertips is over 1.6cm on 
average [□]. 

5.2 Discussion 

The results of the experiments can be seen in Table 1. In Figure 7 we display a precision- 
recall curve and error threshold graph for the thumb on the HandNet test-set for all exper¬ 
iment types which is representative of the behavior of all fingertips. The results show that 
the use of DeROT improves over the baseline results for all measurements for both RDT 
and CNN for experiments on both datasets. On HandNet, when training an RDT and CNN 
on ground truth derotated data, we see that test-time use of DeROT yields improvement in 
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Test set derotation method None (a) Procrustes (b) DeROT (c) Oracle 

mP mAP mP mAP mP mAP mP mAP 

HandNet 

RDT trained on non-derotated data 

0.51 |0.79 

0.49 1 0.77 0.55 | 0.85 0.60 | 0.87 

RDT trained on derotated data 

0.32 |0.60 

0.63 0.88 0.75 0.95 

CNN trained on non-derotated data 

0.44 [0.73 

0.42 1 0.73 0.46 1 0.77 0.50 | 0.79 

CNN trained on derotated data 

0.30 1 0.59 

0.61 0.88 0.74 0.95 

NYUHands 

RDT trained on non-derotated data 

0.51 [0.75 

0.47 1 0.73 0.58 | 0.84 0.61 | 0.86 

RDT trained on derotated data 

0.35 |0.58 

0.63 0.88 0.68 0.89 

CNN trained on non-derotated data 

«!» 

0.36 1 0.69 0.46 | 0.80 0.48 | 0.81 

CNN trained on derotated data 

0.23 10.42 

0.49 0.72 0^ 0.73 


Table 1 I Results of our experimental evaluation for all experiment types described in Section 5.1. The experiment types 
are highlighted in red, pink, blue and cyan respectively. A result in bold indicates that it outperforms the baseline (I) shown in red. 
For each row pair (derotated training data vs non-derotated training), the underlined result is the better of the two. ProciTJStes 
consistently reduces the quality of fingertip detection. Conversely, DeROT outperforms the baseline/or every experiment. For all 
but one experiment, this improved performance is significantly enhanced by training on derotated data instead of original data. See 
Section 5.2. The results from the Oracle serve as an upper bound achievable by derotation. 




(a) RDT 


CNN 


Figure 1 \ These graphs show typical precision to recall and precision to eiTor threshold for thumb detection (using RDT and 
CNN on the HandNet test set. Each line indicates an experiment which is labeled in the legend using the experiment types from 
Section 5 and the derotation types Procrustes(fl), DeROT(Z?), Oracle(c). The baseline is in red. Training on derotated data and then 
applying DeROT or Oracle is in cyan or green respectively. Training on non-derotated data and then applying DeROT or Oracle 
is in magenta or black respectively. The average precision (AP) and precision at 1cm error (P@lcm) are shown for each thumb 
experiment. 


mAP of 11.3% and 20.5% over the respective baselines. For NYUHands, DeROT gives an 
RDT a gain of 17.3% in mAP when trained on derotated data and a CNN achieves mAP 
gains of 14.2% when trained on underotated data but only a marginal gain of 2.5% when 
trained on derotated data. We found that the conhdence values for this specific case were not 
reliable (which directly effects mAP) because of confusion between fingertips (specifically 
index and ring) which further justified the creation of HandNet. For all experiments and 
datasets the mP when using DeROT shows improvements of between 7.8% and 21.1% on 
underotated training data and between 23.5% and 38.6% for derotated training data. The 
simplistic Procrustes derotation negatively impacts fingertip detection relative to the base¬ 
line and we therefore chose not to build and train an RDT and CNN on Procrustes derotated 
versions of the two datasets. For our experiments a single RDT mostly outperforms a CNN. 
Although they are trained with different data and objectives it hints that there is no silver 
bullet to determining which machine learning approach is more appropriate. 
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6 Conclusions and future work 

We have shown that using derotation, specifically DeROT, significantly improves the local¬ 
ization ability of machine-learning based per-frame fingertip detectors by reducing the vari¬ 
ance of the pose space. Furthermore we find that this procedure works despite the extremely 
high range of potential poses. We see this approach as an alternative to data augmentation 
and as a potentially useful additional step in pipelines dedicated to articulated object pose 
extraction such as hands. Although we have used no prior model or kinematic constraints to 
improve the detection results this is currently an active area that we are investigating. Also, 
in this work we have considered results only on depth images but it would be interesting to 
apply a similar pipeline to pure 2D color images. 
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